Use Regular Expression in URL filter
Previous  Top  Next

URL filters allow you to easily control Project downloads by setting which pictures/pages should be loaded and which should be skipped.

URL Filters are divided into four parts:

·Page URL Include Filters - determine which HTML pages should be accessed and analyse to follow the links.  
·Page URL Exclude Filters - determine which HTML pages should be skipped.  
·Picture URL Include Filters - determin which pictures should be downloaded.  
·Picture URL Exclude Filters - determin which pictures should be skipped.  

You may enter several keywords into each of these filter lists, using a semicolon (;) to separate keywords.
You can use Perl like Regular Expression as keyword, A regular expression is a string of characters which tells PicaLoader which URL (or URLs) you are looking for. The following explains the format of regular expressions in detail. If you are familiar with Perl, you already know the syntax.

1.Simple Regular Expressions:
In its simplest form, a regular expression is just a word or phrase to search for. For example,
beatles

would match any URL with the string "beatles" in it, or which mentioned the word "beatles" in the URL line.Thus, URLs like "xxx.beatles.xxx", "xxx.music.xxx/beatles.htm" or "xxx.anmimal.xxx/beatleswild.htm" would all be matched.

2.Metacharacters:Some characters have a special meaning to the filter. These characters are called metacharacters. Although they may seem confusing at first, they add a great deal of flexibility and convenience to the filter.

The period (.) is a commonly used metacharacter. It matches exactly one character, regardless of what the character is. For example, the regular expression:
pic.01

will match "pic001" and "pic101"... Note that the period matches exactly one character-- it will not match a string of characters, nor will it match the null string. Thus, "picture01" and "pic01" will not be matched by the above regular expression.

But what if you wanted to match for a URL containing a period? For example,
pic001.jpg

This would indeed match "pic001.jpg", but it would also match "pic001ajpg", "pic0011jpg"... In short, any string of the form "pic001xjpg", where x is any character, would be matched by the regular expression above.
To get around this, we introduce a second metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to match for the string "pic001.jpg", we would use:
pic001\.jpg

This is called "quoting". We would say that the period in the regular expression above has been quoted. In general, whenever the backslash is placed before a metacharacter, the searcher treats the metacharacter literally rather than invoking its special meaning.

The question mark (?): indicates that the character immediately preceding it either zero times or one time. Thus
pic0?1

will match "pic1" and "pic01".

The star (*): indicates that the character immediately to its left may be repeated any number of times, including zero. Thus
pic0*1

will match "pic1", "pic01", "pic001", "pic0001", and any string that starts with an "pic", is followed by a sequence of "0"'s, and ends with a "1".

The plus (+): indicates that the character immediately preceding it may be repeated one or more times. It is just like the star metacharacter, except it doesn't match the null string. Thus
pic0+1

would not match "pic1", but it would match "pic01", "pic001", "pic0001" and so on.

Metacharacters may be combined. A common combination includes the period and star metacharacters, with the star immediately following the period. This is used to match an arbitrary string of any length, including the null string. For example:
pic.*1

would match "pic1", "pic01" and even "picture_001" Any string that starts with "pic", is followed by an arbitrary string, and ends with "1" will be matched. Note that the null string will be matched by the period-star pair; thus, "pic1" would be matche by the above expression.

3.Earlier it was mentioned that the backslash can turn ordinary characters into metacharacters, as well as the other way around.

The digit metacharacter: which is invoked by following a backslash with a lower-case "d", like this: "\d". The "d" must be lower case. The digit metacharacter matches exactly one digit; that is, exactly one occurence of "0", "1", "2", "3", "4", "5", "6", "7", "8" or "9". For example, the regular expression:
pic\d\.jpg

would match "pic0.jpg", "pic1.jpg" and so forth. Similarly,
pic\d\d\.jpg

would match "pic00.jpg", "pic01.jpg" ~ "pic99.jpg".
We could combine the digit metacharacter with other metacharacters; for instance,
pic\d+\.jpg

matches any string starting with "pic", followed by a string of numbers, followed by a ".jpg". (Note that the plus is used, and thus "pic.jpg" is not matched.)

The non-digit metacharacter: which uses the uppercase "D". The non-digit metacharacter looks like "\D" and matches any character except a digit. Thus,
pic\D\.jpg

would match "pica.jpg", "picZ.jpg" or "pic+.jpg", but would not match "pic1.jpg", "pic5.jpg" or "pic9.jpg". Similarly,
\D+

Matches any non-null string which contains no numeric characters.

The word metacharacter: which matches exactly one letter, one number, or the underscore character (_). It is written as "\w". It's opposite, "\W", matches any one character except a letter, a number or the underscore. Thus,
a\wz

would match "abz", "aTz", "a5z", "a_z", or any three-character string starting with "a", ending with "z", and whose second character was either a letter (upper- or lower-case), a number, or the underscore. Similarly,
a\Wz

would not match "abz", "aTz", "a5z", or "a_z". It would match "a%z", "a{z", "a?z" or any three-character string starting with "a" and ending with "z" and whose second character was not a letter, number, or underscore. (This means the second character must either be a symbol or a whitespace character.)

The braces metacharacter: This metacharacter follows a normal character and contains two number separated by a comma (,) and surrounded by braces ({}). It is like the star metacharacter, except the length of the string it matches must be within the minimum and maximum length specified by the two numbers in braces. Thus,
pic0{3,5}\.jpg

will match "pic000.jpg" and "pic00000.jpg". No other string is matched. Likewise,
pic.{3,5}\.jpg

will match "pic000.jpg", "pic99999.jpg" or "picabc.jpg", but not "pic00.jpg", since "00" is only two characters long.

The alternative metacharacter: is represented by a vertical bar (|). It indicates an either/or behavior by separating two or more possible choices. For example:
beatles|u2

will match any subject containing the strings "beatles" or "u2" or both.

The bracket metacharacter: matches one occurence of any character inside the brackets ([]). For example,
pic_[abf]\.jpg

will match "pic_a.jpg", "pic_b.jpg" and "pic_f.jpg", but not "pic_0.jpg", "pic_c.jpg" or "pic_e.jpg". Similarly,
Ranges of characters can be used by using the dash (-) within the brackets. For example,
pic[a-d]\.jpg

will match "pica.jpg", "picb.jpg", "picc.jpg" or "picd.jpg", and nothing else. Likewise,
wallpaper[3-5]\d\.jpg

will match "wallpaper30.jpg" ~ "wallpaper59.jpg".
If you wish to include a dash within brackets as one of the characters to match, instead of to denote a range, put the dash immediately before the right bracket. Thus:
a[1234-]z

and
a[1-4-]z

both do the same thing. They both match "a1z", "a2z", "a3z", "a4z" or "a-z", and nothing else.

The bracket metacharacter can also be inverted by placing a caret (^) immediately after the left bracket. Thus,
wallpaper[^02468]\.jpg

matches any ten-character string starting with "wallpaper" and ending with anything except an even number. Inversion and ranges can be combined, so that
\W[^f-h]ood\W

matches any four letter wording ending in "ood" except for "food", "good" or "hood". (Thus "mood" and "wood" would both be matched.)
Note that within brackets, ordinary quoting rules do not apply and other metacharacters are not available. The only characters that can be quoted in brackets are "[", "]", and "\". Thus,
[\[\\\]]abc

matches any four letter string ending with "abc" and starting with "[", "]", or "\".

4.The table below lists some of the more useful special (meta) characters.
Reg-expr
Description
.
Matches any character (except newline)
x?
Matches 0 or 1 x's, where x is any regular expression
x*
Matches 0 or more x's
x+
Matches 1 or more x's
foo|bar
Matches one of foo or bar
[xyz]
Matches any character in the set xyz, specify ranges with a -
[^xyz]
Matches any single character not in the set xyz
\w
Matches an alpha-numeric character, i.e., [a-zA-Z0-9_]
(x)
Brackets a regular expression
\metachar
Matches the metacharacter (takes away its special meaning)


5.The search is case insensitive;
thus
picture
and
Picture
and
PICTURE
all search for the same set of strings. Each will match "picture", "PICTURE", "Picture", "PicTure" and so forth. Thus you need not worry about capitalization. (Note, however, that metacharacter must still have the proper case. This is especially important for metacharacters whose case determines whether their meaning is reversed or not.)